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Recently long range correlations were detected in nucleotide sequences 
and in human writings by several authors. We undertake here a system- 
atic investigation of two books, Moby Dick by H. Melville and Grimm's 
tales, with respect to the existence of long range correlations. The anal- 
ysis is based on the calculation of entropy like quantities as the mutual 
q\ ', information for pairs of letters and the entropy, the mean uncertainty, 

per letter. We further estimate the number of different subwords of a 
given length n. Filtering out the contributions due to the effects of the 
finite length of the texts, we find correlations ranging to a few hundred 



-a 



letters. Scaling laws for the mutual information (decay with a power 
law), for the entropy per letter (decay with the inverse square root of 
n) and for the word numbers (stretched exponential growth with n and 
with a power law of the text length) were found. 

From a formal point of view a book may be considered as a linear string of letters. 
In this respect there exists a similarity to other linear structures [|l|. The main in- 
formation carriers in living systems are sequences of amino acids and/or nucleotides, 
other examples are pieces of music recorded on tapes or on paper, computer pro- 
grams written on disks or tapes etc. By using the methods of symbolic dynamics 
any trajectory of a dynamic system (deterministic or stochastic) may be mapped to 
a string of letters on a certain alphabet. Usually the strings generated by dynamical 
systems show only short range correlations, except under critical conditions where, 
in analogy to equilibrium phase transitions ||, correlations on all scales may be 
observed ||. Recently long range correlations were detected in DNA sequences Q 
and in human writings ||. The intrinsic difficulties connected with the analysis of 
long range correlations in DNA led to a controversial discussion about the authentic 
character of long range structures in DNA || . 

Naturally the question arises whether long range correlations may be found in 
other information carrying strings too. This work is devoted to the investigation of 
long range correlations in texts. We use the methods of entropy analysis, which were 
first applied to texts by Claude Shannon in 1951 |7j. For several reasons we expect 
the existence of long range structures in these sequences. Since a book is written in 



a unique style and according to a general plan of the author, we expect correlations 
which are ranging from the beginning of a text up to the end ||. Already Shannon 
wrote: "From this analysis it appears that, in ordinary literary English, the long 
range statistical effects (up to 100 letters) reduce the entropy. . . " . 

Another strong argument for long correlations is based on the combinatorial 
explosion. Uncorrelated sequences generated on an alphabet of A letters have a 
manifold of N(n) = A n = exp [n ■ ln(A)] different subwords of length n. A subword 
(block) is here any combination of letters included the space, punctuation marks 
and numbers. For n > 100 the number N(n) is extremely large. In other words we 
need very sharp restrictions to select a meaningful subset. Long range correlations 
provide such a strong selection criterion. Hence we we must expect that only a very 
small subset N*(n) of the possible words appears in a text. Bounds of this kind are 
given by the rules of writing texts, i.e. by the rules of syntax as well as by semantic 
relations. These rules do not allow for an arbitrary concatenation of letters to words 
and of words to sentences but lead to a limitation of the growth of the number of 
allowed letter combinations with n. The problem we address here is, whether the 
function N*(n) follows a simple scaling law. 

In earlier papers the conjecture has been made that the number of allowed sub- 
words scales according to a stretched exponential law || || 

N*(n) ~ exp[cra a ] with a < 1/2 , c = const. (1) 

The reduction due to the scaling rule ([!]) reduces the number of the allowed sub- 
words drastically (N*(n) <C X n ) for large n. In order to describe a given string of 
length L using an alphabet of A letters we introduce the following notations ||: 
Let AiA 2 . . . A n be the letters of a given substring of length n < L. Let further 
p( n '(Ai . . . A n ) be the probability for this substring (block). A special case is the 
probability to find a pair with (n — 2) arbitrary letters in between p( n \A 1 , A n ). Then 
we may introduce the mutual information for two letters in distance n (also called 



transinformation) [fLOl [ 1 1 



J(n)= Y.V {n \A^A 3 )\og 



p^(A i )-p^(A j 



3)1 



(2) 



The mutual information is a special measure for correlations which is closely related 
to the autocorrelation function ||[]ni]-[r2]. Further we define the entropy per block 



of length n [13 



H n = -T,P (n \ A i • ■ ■ AO logp^(A 1 ...A n ) . (3) 
The block entropy is related to the mean number of words by 

N*(n)~\ H - . (4) 

As shown already by Shannon, the entropy per letter of blocks of length n H n /n is 
an important quantity expressing the structure of sequences. In [|J we assumed the 
following scaling behaviour for a definite class of strings at large n-values 

H n /n = h + g ■ n^ 1 + e/n , . 

< Hq < I . [ } 



Here h, the limit of the mean uncertainty, is called the entropy of the source. This 
quantity is positive for stochastic as well as for chaotic processes, g and e are con- 
stants; if h, e > and g = the correlations in the string are short range corre- 



sponding to a Markov process with a finite memory |13[]. For periodic strings one 
finds h = g = 0, e > 0. The existence of a long range order in strings may be char- 
acterized by the condition g > describing a slowly decaying contribution to the 
asymptotics of the entropy per letter for large n. Of special interest for the further 
consideration of texts is the case h = 0, g > corresponding to a power law tail of 
the entropy decaying slower than 1/n. A working hypothesis developed earlier || 
is, that this is the typical behaviour for texts. In other words long texts are strings 
on the borderline between periodicity and chaos, showing long range correlations. 

The mutual information (transinformation) is not a monotonic function of n. We 
define long range effects by power law tails of the averaged mutual information I in). 
Here the averaging is carried out over a window comprising several of the typical 
oscillations (fluctuations). Several authors have demonstrated that DNA-sequences 
show a slowly decaying fluctuations at large scales [ Jill • As mentioned already, 



for DNA some evidence for the existence of long range correlations was found [ 
0-@ i@. 



We will apply the methods of entropy analysis to literary English represented 
by the books: "Moby Dick" by Melville (L « 1, 170,200) and Grimm's Tales (L « 
1,435,800). Let us just mention that pieces of music may be treated in a similar 
way ||. For simplification we use an alphabet consisting of 32 symbols: the small 
letters abcdef...xyz the marks , . ( ) # and the space; # stands for any 
number. In order to get a better statistics we have used for the entropy calculations 
also a restricted alphabet consisting of only 3 letters 0, M, L. The letter codes here 
for vowels, the letter M stands for consonants and the letter L stands for spaces and 
marks. 

The calculation of the mutual information requires counting frequencies of pairs 
of letters at distance n. Since the number of different pairs is 32 2 = 1024 we have 
for our books a good statistics. The function I(n) is a measure for the correlations 
of letters in the distance n. Every peak at n corresponds to a positive correlation. 

In Fig. ^ we show the mutual information calculated for Moby Dick and for 
Grimm's Tales (A = 32). The results show well expressed correlations in the range 
n = 1 ... 25 which are followed by a long slowly decaying tail. The obtained values 
for the transinformation I(k) become meaningless if they are smaller than the level 
of the fluctuations which are due to the finite length L of the text. According to 



Herzel et. al. [ITOj ||I4| the level of these fluctuations is 



For our rather long texts with L > 10 6 the fluctuation level has a value of about 10~ 4 . 
The smoothed values for the mutual information for the range n = 25 . . . 1000 may be 
fitted by the scaling law I(k) = c\-n~ 037 + C2 with c\ = 1.5-10 -4 , C2 = 1.1-10 -4 . The 
constant C2 corresponds here to the level of fluctuations. Our results prove that long 
texts show pair correlations which decay, at least up to distances of several hundred 



letters, according to a power law. However due to the greater uniformity of texts 
these correlation tails are not as strong as observed for DNA sequences 1] [14 . 



For the calculation of entropies we must count the frequencies of subwords, where 
a subword of length n is defined as any combination of n letters. The result of count- 
ing the words consisting of n = 4, 9, 16, 25 letters in Grimm's Tales is shown in 
Fig. [| in a rank ordered representation. The structure of the rank ordered distri- 
butions is for both texts rather similar, however the list of words is of course quite 
different. For example among the most frequent subwords of length n = 25 are in 
the case of Moby Dick ll _greenland_or_right_whale" , and "_species_of_theJeviathan" . 
For Grimm's Tales rather frequent subwords are e.g. u _if_Lcould_butshudder.J' and 
u princess, -openJhe-door-f . 

Let us still mention that the form of the subword distributions is distinctly not 
Zipf-like, it does not follow a power law. In the opposite, with increasing n there 
is a tendency to form a Fermi-like plateau M. This follows from the theorem of 
asymptotic equipartition derived by McMillan and Khinchin. This theorem tells us 
that for n —>■ oo the asymptotic form of the distribution is rectangular, i.e. the 
N*(n) allowed subwords of length n appear with nearly equal frequency. The effects 
due to finite n and the effects of finite length L tend to smooth the edges of the 



distribution [14]. The importance of length corrections for estimating the frequencies 
of subwords was considered by several authors [|l(J |L3| . For a deeper analysis of this 
problem we refer to recent articles [fr4"||I7|. Our method for the entropy analysis 



uses an extrapolation of the entropy to infinite text length ||. We mention also a 
quite different approach based on the guess of the distribution function for infinite 
text length @|T§. 



The probabilities which we need for the calculation of entropies are unknown 
and can only be estimated from the frequencies iVj(n) of the subwords of length n 
in a text of length L containing N = L + 1 — X subwords. Introducing the observed 
subword frequencies into the entropy definition leads to the observed entropies 

H? = log(iV) -l^ Ni (n) ■ log(tf,(n)) . (7) 

i 

This is a random variable with the expectation value 

^r = « s ) = iog(A r )-^E(^H-iog(iv,H)) ■ (8) 

i 

Assuming a Bernoulli distribution for the letter combinations, the mean values can 
be calculated explicitly |jl|[|14|. The result is 



_(H n -£jjp if iV*(n)«iV 

n » \ logtAO-log^)^ if N*{n)^N . W 

The relation between the effective number of words N*(n) and the block entropy 
H n is given by eq. (|). Hence the expected block entropy may be represented as a 
function of log iV with one free parameter H n which is found by fitting the curves. 
In this way the block entropies for both books were calculated up to n = 26. For 



small word length we used the approximation (|D for N*(n) <C N and for larger n 
we applied the approximation valid for N*(n) ^> N. In the intermediate region we 
applied a smooth Pade approximation between both formulae. In a procedure of 
successive approximations the entropy H n was considered as a free parameter which 
was fitted in a way that H^ xp (\ogN) came as close as possible to the measured 
(observed) entropy values. In practice this method breaks down for n > 30 if A = 3 
and for n > 25 if A = 32. Longer subwords do not have a chance to appear several 
times in the text, what leads to large statistical errors. 

The calculations for n < 26 show that the square root law yields a reasonable 
approximation for the scaling of the entropy per letter with the word length n 

H n /(n ■ log(A)) « (4.84/0*) -(7.57/n) (A = 3) 

# n /(n.log(A)) « (0.9/Vn) +(1.7/n) (A = 32) . llUj 

Fig. |H shows the fit for the alphabet A = 3. The scaling law of the square root type 
was first found by Hilberg [[| by fitting Shannons original data. For n = 100 and 
A = 32 our scaling formula gives H wo ~ 10 • log(A) what is not far from Shannon's 
estimation -f/ioo ~ 40 bits. 

The number of subwords increases according to a stressed exponential law. For 
the law of growth we found the approximation 

N* ss 2 23 - 5 ^- 35 - 5 (A = 3) (11) 

N* S3 2 4 - 5 v^+ 8 - 5 (A = 32) . (12) 

We summarize now the results obtained for the two books: The scaling of the mutual 
information and the entropy per letter shows in agreement with earlier work [[| that 
long texts are neither periodic nor chaotic but somehow in between. We found 
correlations in the range up to 10 3 positions. The existence of such correlations is 
to be seen in the statistics of pairs of letters and of blocks (subwords) of letters. 
We developed methods, for the calculation of entropies from the given samples of 
limited length. Taking into account length corrections we calculated block entropies 
up to n = 26 and mutual informations up to distances of a few hundred letters. 
Based on these data we formulated a hypothesis about the long range scaling. For 
the range n ^> 100 the pair correlations contained in the transinformation of long 
texts L > 10 6 decay according to a power law, however the differences to Bernoulli 
samples of the same length are rather small. A reliable estimation of the block 
entropies for n > 30 is still an open question. The results for the entropy of the 
two books suggest in agreement with Shannon's data and Hilberg's findings that 
the mean entropy per letter decays to zero according to a square root law. As a 
consequence the number of different subwords in texts increases with the number of 
letters n according to a stretched exponential law. Our estimations for the growth 
yield for n = 100 a total number of about 2 53 different subwords. In spite of the fact 
that this number is very large, it is indeed small in comparison to Bernoulli strings 
where 2 108 different subwords of length n = 100 exist. In this way we observe a 
very strong selection among the combinatorial possibilities. Most of the subwords 
which would be possible from the combinatorial point of view are actually forbidden 



and do not appear in real texts. We investigated also how the number of genuine 
english words N(L) (formally defined here as sequences of letters between spaces 
and/or marks) increases with the length L of a text. For Grimm's Tales we found 
the scaling law N(L) = 22.8 • L 046 . In other words with increasing length always 
new words are introduced, no saturation with text length could be observed. 

More empirical data on long texts and further studies of the statistical effects 
due to finite length of the samples are needed in order to reach a more definite 
conclusion about the scaling properties. 

The authors thank K. Albrecht, J. Freund, H. Herzel, G. Nicolis and A. Schmitt 
for fruitful discussions and collaboration on several tasks of the methods leading to 
the results presented here. 
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Figure 1: The mutual information calculated for Melville's Moby Dick and for 
Grimm's Tales (\ = 32, n < 25). 



Figure 2: The observed rank ordered distribution of words of length n = 4, 9, 16, 25 
for Grimm's Tales. 



Figure 3: The scaling behaviour of the block entropy H n with the square root of the 
word length n for Moby Dick encoded by the alphabet A = 3. 



